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(54) Automatic insertion of non-verbalized punctuation in speech recognition 



(57) Recognizing punctuation in computer-imple- 
mented speech recognition includes performing speech 
recognition on an utterance to produce a recognition re- 
sult for the utterance. A non-verbalized punctuation 
mark is identified in a recognition result and the recog- 
nition result is formatted based on the identification. 
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Description 

[0001] This description relates to automatic insertion 
of non-verbalized punctuation in speech recognition. 
[0002] A speech recognition system analyzes a user's 
speech to determine what the user said. Most speech 
recognition systems are frame-based. In a frame-based 
system, a processor divides a signal descriptive of the 
speech to be recognized into a series of digital frames, 
each of which corresponds to a small time increment of 
the speech. 

[0003] A speech recognition system may be a "dis- 
crete" system that recognizes discrete words or phrases 
but which requires the user to pause briefly between 
each discrete word or phrase. Alternatively, a speech 
recognition system may be a "continuous" system that 
can recognize spoken words or phrases irrespective of 
whether the user pauses between them. 
[0004] In general, the processor of a continuous 
speech recognition system analyzes "utterances" of 
speech. An utterance includes a variable number of 
frames and corresponds, for example, to a period of 
speech followed by a pause of at least a predetermined 
duration. 

[0005] The processor determines what the user said 
by finding sequences of words that jointly fit the acoustic 
model and language model and best match the digital 
frames of an utterance. An acoustic model may corre- 
spond to a word, a phrase, or a command from a vocab- 
ulary. An acoustic model also may represent a sound, 
or phoneme, that corresponds to a portion of a word. 
Collectively, the constituent phonemes for a word rep- 
. resent the phonetic spelling of the word. Acoustic mod- 
els also may represent silence and various types of en- 
vironmental noise. 

[0006] The words or phrases corresponding to the 
best matching acoustic models are referred to as rec- 
ognition candidates. The processor may produce a sin- 
gle recognition candidate for an utterance, or may pro- 
duce a list of recognition candidates. In producing the 
recognition candidates, the processor may make use of 
a language model that accounts for the frequency at 
which words typically are used in relation to one another. 
[0007] In one general aspect, recognizing punctua- 
tion in computer-implemented speech recognition in- 
cludes performing speech recognition on an utterance 
to produce a recognition result for the utterance. A non- 
verbalized punctuation mark is identified in a recognition 
result and the recognition result is formatted based on 
the identification. 

[0008] Implementations may include one or more of 
the following features. For example, the non-verbalized 
punctuation mark may be identified by predicting the 
non-verbalized punctuation mark using at least one text 
feature and at least one acoustic feature related to the 
utterance. The acoustic feature may include a period of 
silence, a function of pitch of words near the period of 
silence, an average pitch of words near the period of 



silence, and/or a ratio of an average pitch of words near 
the period of silence. 

[0009] The recognition result may be formatted by 
controlling or altering spacing relative to the non-verbal- 

s ized punctuation mark. The recognition result may be 
formatted by controlling or altering capitalization of 
words relative to the non-verbalized punctuation mark. 
[0010] In one implementation, the non-verbalized 
punctuation mark may Include a period and the recog- 

10 nrtion result may be formatted by inserting an extra 
space after the period and capitalizing a next word fol- 
lowing the period. 

[001 1] A portion of the recognition result that includes 
the non-verbalized punctuation mark may be selected 

15 for correction and that portion of the recognition result 
may be corrected with one of a number of correction 
choices. At least one of the correction choices may in- 
clude a change to the non-verbalized punctuation mark. 
At least one of the correction choices may not include 

20 the non-verbalized punctuation mark. 

[001 2] In another genera! aspect, correcting incorrect 
text associated with recognition errors in computer-im- 
plemented speech recognition may include performing 
speech recognition on an utterance to produce a recog- 

25 nition result for the utterance. A portion of the recogni- 
tion result that includes the non-verbalized punctuation 
may be selected for correction and that portion of the 
recognition result may be corrected with one of a 
number of correction choices. 

30 [0013] Implementations may include one or more of 
the following features. For example, at least one of the 
correction choices may include a change to the non-ver- 
balized punctuation. At least one of the correction choic- 
es may not Include the non-verbalized punctuation. The 

35 non-verbalized punctuation may include a non-verbal- 
ized punctuation mark. The non-verbalized punctuation 
may be changed and text surrounding the non-verbal- 
ized punctuation may be reformatted to be grammati- 
cally consistent with the changed non-verbal ized punc- 

40 tuation. The changes to the non-verbalized punctuation 
and reformatting of the text may be in response to a sin- 
gle user action. 

[0014] In another general aspect, recognizing punc- 
tuation, in computer-implemented speech recognition 

45 dictation may include performing speech recognition on 
an utterance to produce a recognition result for the ut- 
terance. A non-verbalized punctuation mark may be 
identified in the recognition result and it may be deter- 
mined where to insert the non-verbalized punctuation 

so mark within the recognition result based on the identifi- 
cation using at least one text feature and at least one 
acoustic feature related to the utterance to predict where 
to insert the non-verbalized punctuation mark. 
[0015] Implementations may include one or more of 

55 the following features. For example, the acoustic feature 
may include a period of silence, a function of pitch of 
words near the period of silence, an average pitch of 
words near the period of silence, and/or a ratio of an 



2 



2/22/2007, EAST Version: 2.1.0.14 



3 



EP 1 422 692 A2 



4 



average pitch of words near the period of silence. 
[001 6] In another general aspect, a graphical user in- 
terface for correcting incorrect text associated with rec- 
ognition errors in computer-implemented speech recog- 
nition may include a window to display a selected rec- 
ognition result including non-verbalized punctuation as- 
sociated with an utterance. The graphical user interface 
also includes a list of recognition alternatives with at 
least one of the recognition alternatives including a 
change to the non-verbalized punctuation and associ- 
ated adjustments in spacing and other punctuation. 
[0017] Implementations may include one or more of 
the following features. For example, the non-verbalized 
punctuation may include a period. The non-verbalized 
punctuation may include a comma. 
[0018] In one implementation, the change to the non- 
verbalized punctuation may include a change from a pe- 
riod to a comma and the associated adjustments in 
spacing and other punctuation may include removing a 
space after the comma and uncapitalizing a word fol- 
lowing the comma. In another implementation, the 
change to the non-verbalized punctuation may include 
a change from a comma to a period. The associated ad- 
justments In spacing and other punctuation may include 
adding a space after the period and capitalizing a word 
following the period. 

[001 9] These general and specif ic aspects may be im- 
plemented using a system, a method, or a computer pro- 
gram, or any combination of systems, methods, and 
computer programs. 

[0020] Other features and advantages will be appar- 
ent from the description and drawings, and from the 
claims. 

[0021] The present invention will be described, by 
way of example, with reference to the accompanying 
drawings, in which: 

Fig. 1 is a block diagram of a speech recognition 
system; 

Figs. 2 and 3 are block diagrams of speech recog- 
nition software of the system of Fig. 1 ; 
Fig. 4 is a representation of an algorithm for per- 
forming automatic insertion of non-verbalized punc- 
tuation using the system of Fig. 1 ; 
Fig. 5 is a representation of data used in the algo- 
rithm of Fig. 4. 

Figs. 6 and 7 are flow charts of exemplary process- 
es for determining whether or not to insert non-ver- 
balized punctuation and, if so, which non-verbalized 
punctuation to insert; 

Figs. 8 and 9 are screen shots of a correction dia- 
logue used in the system of Fig. 1; 
Fig. 10 is a flow chart of an exemplary process for 
adjusting punctuation and spacing; 
Fig. 11 is a flow chart of an exemplary process for 
recognizing punctuation in computer-implemented 
speech recognition; and, 

Fig. 12 is a flow chart of an exemplary process for 



correcting incorrect text associated with recognition 
errors in computer-implemented speech recogni- 
tion. 

5 [0022] Like reference symbols In the various drawings 
may indicate like elements. 

[0023] In traditional speech recognition systems, in 
order to have punctuation marks, such as, for example, 
commas, periods (full stops), and question marks, ap- 

10 pear in the recognized text, each punctuation mark must 
be pronounced. However, in natural speech, punctua- 
tion marks usually are not pronounced. Accordingly, a 
speech recognition system may include a punctuation 
system that automatically determines where to insert 

is punctuation marks in recognized text without requiring 
the punctuation marks to be pronounced, and then ad- 
justs the recognized text based on the determination. 
[0024] Referring to Fig. 1 , a speech recognition sys- 
tem 100 includes input/output (I/O) devices (for exam- 

20 pie, a microphone 102, a mouse 104, a keyboard 106 
and a display 1 08) and a computer 1 1 0 having a central 
processing unit (CPU) 112, an I/O unit 114, and a sound 
card 1 1 6. A memory 1 1 8 stores data and programs such 
as an operating system 120 (for example, DOS, Win- 

25 dows®, Windows® 95, Windows® 98, Windows® 2000, 
Windows® NT, Windows® Millennium Edition, Win- 
dows® XP, OS/2®, Mac OS®, and Linux), an application 
program 122, and speech recognition software 124. 
Other examples of system 1 00 include a workstation, a 

30 server, a device, a component, other equipment or some 
combination thereof capable of responding to and exe- 
cuting instructions in a defined manner. 
[0025] Examples of application programs 1 22 include 
authoring applications (for example, word processing 

35 programs, database programs, spreadsheet programs, 
presentation programs, electronic mail programs and 
graphics programs) capable of generating documents 
or other electronic content, browser applications (for ex- 
ample, Netscape's Navigator and Microsoft's Internet 

40 Explorer) capable of rendering standard Internet con- 
tent, personal information management (PIM) programs . 
(for example. Microsoft® Outlook®, Outlook® Express, 
and Lotus Notes®) capable of managing personal infor- 
mation, and other programs (for example, contact man- 

45 agement software, time management software, ex- 
pense reporting applications, and fax programs). Any of 
the Dragon Natural lySpeaking® software versions, 
available from ScanSoft, Inc. of Peabody, Massachu- 
setts, offer examples of suitable speech recognition 

so software 124. 

[0026] ThecomputerHOmaybeusedforspeech rec- 
ognition. In this case, the microphone 102 receives the 
user's speech and conveys the speech, in the form of 
an analog signal, to the sound card 116, which in turn 

55 passes the signal through an analog-to-digital (A/D) 
converter to transform the analog signal into a set of dig- 
ital samples. Under control of the operating system 1 20 
and the speech recognition software 124, the processor 
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112 identifies utterances in the user's speech. Utteranc- 
es are separated from one another by a pause having 
a sufficiently large, predetermined duration (for exam- 
ple, 1 60-250 milliseconds). Each utterance may include 
one or more words of the user's speech. 
[0027] The system also may include an analog re- 
corder port 126 and/or a digital recorder port 128. The 
analog recorder port 1 26 is connected to the sound card 
116 and is used to transmit speech recorded using an 
analog or digital hand-held recorder to the sound card. 
The analog recorder port 126 may be implemented us- 
ing a line-in port. The hand-held recorder is connected 
to the port using a cable connected between the line-in 
port and a line-out or speaker port of the recorder. The 
analog recorder port 126 may be implemented as a mi- 
crophone positioned so as to be next to the speaker of 
the hand-held recorder when the recorder is inserted in- 
to the port 126, and also may be implemented using the 
microphone 102. Alternatively, the analog recorder port 
1 26 may be implemented as a tape player that receives 
a tape recorded using a hand-held recorder and trans- 
mits information recorded on the tape to the sound card 
116. 

[0028] The digital recorder port 128 may be imple- 
mented to transfer a digital file generated using a hand- 
held digital recorder 130. This file may be transferred 
directly into memory 11 8, or to a storage device such as 
hard drive 1 32. The digital recorder port 1 28 may be im- 
plemented as a storage device (for example, a floppy 
drive or CD-ROM drive) of the computer 110, or as an 
I/O port (for example, a USB port). 
[0029] Fig. 2 illustrates components of the speech 
recognition software 124. For ease of discussion, the 
following description indicates that the components car- 
ry out operations to achieve specified results. However, 
it should be understood that each component typically 
causes the processor 112 to operate in the specified 
manner. The speech recognition software 124 typically 
includes one or more modules, such as a front end 
processing module 200, a recognizer, 216, a control/in- 
terface module 220, a constraint grammar module 225, 
an active vocabulary module 230, an acoustic model 
module 235, a pre-filtering module 240, and a backup 
dictionary module 245. 

[0030] Initially, a front end processing module 200 
converts the digital samples 205 from the sound card 
1 1 6 (or from the digital recorder port 128) into frames of 
parameters 21 0 that represent the frequency content of 
an utterance. Each frame may include 24 parameters 
and represents a short portion (for example, 10 millisec- 
onds) of the utterance. 

[0031] A recognizer 215 receives and processes the 
frames of an utterance to identify text corresponding to 
the utterance. The recognizer 21 5 entertains several hy- 
potheses about the text and associates a score with 
each hypothesis. The score reflects the probability that 
a hypothesis corresponds to the user's speech. For 
ease of processing, scores may be maintained as neg- 



ative logarithmic values. Accordingly, a lower score in- 
dicates a better match (a higherprobability) while a high- 
er score indicates a less likely match (a lower probabil- 
ity), with the likelihood of the match decreasing as the 

s score increases. After processing the utterance, the rec- 
ognizer 21 5 provides the best-scoring hypotheses to the 
control/interface module 220 as a list of recognition can- 
didates, where each recognition candidate corresponds 
to a hypothesis and has an associated score. Some rec- 

10 ognition candidates may correspond to text while other 
recognition candidates correspond to commands. Com- 
mands may include words, phrases, or sentences. 
[0032] The recognizer 215 processes the frames 210 
of an utterance in view of one or more constraint gram- 

15 mars 225. A constraint grammar, also referred to as a 
template or restriction rule, may be a limitation on the 
words that may correspond to an utterance, a limitation 
on the order or grammatical form of the words, or both. 
For example, a constraint grammar for menu-manipula- 

20 tion commands may include only entries from the menu 
(for example, "file" or "edit") or command words for nav- 
igating through the menu (for example, "up", "down", 
"top" or "bottom"). Different constraint grammars may 
be active at different times. For example, a constraint 

25 grammar may be associated with a particular applica- 
tion program 122 and may be activated when the user 
opens the application program 122 and deactivated 
when the user closes the application program 122. The 
recognizer 215 may discard any hypothesis that does 

30 not comply with an active constraint grammar. In addi- 
tion, the recognizer 215 may adjust the score of a hy- 
pothesis associated with a particular constraint gram- 
mar based on characteristics of the constraint grammar. 
[0033] Another constraint grammar 225 that may be 

35 used by the speech recognition software 124 is a large 
vocabulary dictation grammar. The large vocabulary 
dictation grammar identifies words included in the active 
vocabulary 230, which is the vocabulary of words known 
to the software. The large vocabulary dictation grammar 

40 also includes a language model that indicates the fre- 
quency with which words occur. 
[0034] Other examples of constraint grammars 225 
include an in-line dictation macros grammar for dictation 
commands, such as "CAP" or "Capitalize" to capitalize 

45 a word and "New-Paragraph" to start a new paragraph; 
a text range selection grammar used In selecting text; 
an error correction commands grammar; a dictation ed- 
iting grammar; an application command and control 
grammar that may be used to control a particular appli- 

50 cation program 122; a global command and control 
grammar that may be used to control the operating sys- 
tem 120 and the speech recognition software 124; a 
menu and dialog tracking grammar that may be used to 
manipulate menus and dialog; and a keyboard control 

55 grammar that permits the use of speech in place of input 
devices, such as the keyboard 106 or the mouse 104. 
A large vocabulary dictation grammar may include mul- 
tiple dictation topics (for example, "medical" or "legal"), 
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each having its own vocabulary file and its own lan- 
guage model. A dictation topic includes a set of words 
that represents the active vocabulary 230. In a typical 
example, a topic may include approximately 30,000 
words that are considered for normal recognition. 
[0035] A complete dictation vocabulary consists of the 
active vocabulary 230 plus a backup vocabulary 245. 
The backup vocabulary 245 may include files that con- 
tain user-specific backup vocabulary words and system- 
wide backup vocabulary words. 
[0036] User-specific backup vocabulary words in- 
clude words that a user has created while using the 
speech recognition software. These words are stored in 
vocabulary files for the user and for the dictation topic, 
and are available as part of the backup dictionary for the 
dictation topic regardless of user, and to the user regard- 
less of which dictation topic Is being used. For example, 
if a user is using a medical topic and adds the word "gan- 
glion" to the dictation vocabulary, any other user of the 
medical topic will have immediate access to the word 
"ganglion". In addition, the word will be written into the 
user-specific backup vocabulary. Then, if the user says 
"ganglion" while using a legal topic, the word "ganglion" 
will be available during correction from the backup dic- 
tionary. 

[0037] In addition to the user-specific backup vocab- 
ulary noted above, there is a system-wide backup vo- 
cabulary. The system-wide backup vocabulary contains 
all the words known to the system, including words that 
may currently be in an active vocabulary. 
[0038] The control/interface module 220 controls op- 
eration of the speech recognition software and provides 
an interface to other software or to the user. The control/ 
interface module 220 receives the list of recognition can- 
didates for each utterance from the recognizer 215. 
Recognition candidates may correspond to dictated 
text, speech recognition commands, or external com- 
mands. When the best-scoring recognition candidate 
corresponds to dictated text, the control/interface mod- 
ule 220 provides the text to an active application, such 
as a word processor. The control/interface module 220 
also may display the best-scoring recognition candidate 
to the user through a graphical user interface. When the 
best-scoring recognition candidate is a command, the 
control/interface module 220 implements the command. 
For example, the control/interface module 220 may con- . 
trol operation of the speech recognition software 124 in 
response to speech recognition commands (for exam- 
ple, "wake up" or "make that") , and may forward external 
commands to the appropriate software. 
[0039] The control/interface module 220 also may 
control the active vocabulary 230, acoustic models 235, 
and constraint grammars 225 that are used by the rec- 
ognizer 21 5. For example, when the speech recognition 
software 124 is being used in conjunction with a partic- 
ular application program 122 (for example. Microsoft 
Word), the control/interface module 220 updates the ac- 
tive vocabulary 230 to include command words associ- 
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ated with that application and activates constraint gram- 
mars 225 associated with the application program 122. 
[0040] Other functions provided, by the control/inter- 
face module 220 include an enrollment program, a vo- 

5 cabulary customizer, and a vocabulary manager. The 
enrollment program collects acoustic information from 
a user and trains or adapts a user's models based on 
that information. The vocabulary customizer optimizes 
the language model of a specific topic. The vocabulary 

10 manager is a tool that is used by developers to browse 
and manipulate vocabularies, grammars, and macros. 
Each function of the control/interface module 220 may 
be implemented as an executable program that is sep- 
arate from the main speech recognition software. 

15 [0041 ] The control/Interface module 220 also may im- 
plement error correction and cursor/position manipula- 
tion procedures of the software 124. Error correction 
procedures include, for example a "make that" com- 
mand and a "spell that" command. Cursor/position ma- 

20 nipulatioh procedures include the "select" command 
discussed above and variations thereof (for example, 
"select [start] through [end)"), "insert before/after" com- 
mands, and a "resume with" command. 
[0042] The control/interface module 220 may imple- 

25 ment error correction procedures of the speech recog- 
nition software 124. When the speech recognition sys- 
tem 1 00 makes a recognition error, the user may invoke 
an appropriate correction command to remedy the error. 
During error correction, word searches of the backup 

so dictionary 245 start with the user-specific backup dic- 
tionary and then check the system-wide backup diction- 
ary. The backup dictionary 245 also is searched when 
there are new words in text that a user has typed. 
[0043] In general, the backup dictionary 245 includes 

35 substantially more words than are included in the active 
vocabulary 230. For example, when the active vocabu- 
' lary 230 has 60,000 or so entries, the backup dictionary 
245 may have roughly 190,000 entries. The active vo- 
cabulary 230 is a dynamic vocabulary in that entries may 

40 be added or subtracted from the active vocabulary over 
time. For example, when the user indicates that an error 
has been made and the control/interface module 220 
uses the backup dictionary 245 to correct the error, a 
new word from the backup dictionary 245 may be added 

^5 to the active vocabulary 230 to reduce the likelihood that 
the error will be repeated. 

[0044] In one implementation, one or more language 
models may be employed by the recognizer. In deter- 
mining the acoustic models that best match an utter- 
so ance, the processor may consult a language model that 
indicates a likelihood that the text corresponding to the 
. acoustic model occurs in speech. For example, one lan- 
guage model may include a bigram model that indicates 
the frequency with which a words occurs in the context 
55 of a preceding word. For instance, a bigram model may 
indicate that a noun or an adjective such as "word" is 
more likely to follow the word "the" than a verb such as 
"is." 
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[0045] The language model may be generated from a 
large sample of text! In general, probabilities produced 
by the language model do not change during use. How- 
ever, the language model may change as words are 
added to or subtracted from the language model as the 
words are added to or subtracted from the active vocab- 
ulary. 

A language model associated with the large vocabulary 
dictation grammar may be a unlgram model that indi- 
cates the frequency with which aword occurs independ- 
ently of context, or a bigram model that indicates the 
frequency with which a word occurs in the context of a 
preceding word. For example, a bigram model may in- 
dicate that a noun or adjective is more likely to follow 
the word. The language model also may be a trigram 
model that indicates the frequency with which a word 
occurs in the context of two preceding words, or some 
other variation. 

[0046] Another exemplary language model is a cate- 
gory language model that indicates the frequency with 
which a word occurs in the context of a preceding cate- 
gory. For example, a simple category model may include 
categories such as "nouns" or "adjectives." For in- 
stance, such a simple category model may indicate that 
the word "is" is more likely to follow words from the 
"nouns" category than words from the "adjectives" cat- 
egory. More complex category models may include cat- 
egories such as "places," "sports adjectives," or "medi- 
cal nouns." As with the word bigram model, the category 
model may be generated from a large sample of data 
and may include probabilities that do not change during 
use. 

[0047] Other exemplary language models may in- 
clude a unigram topic language model, a bigram topic 
language model, and a trigram topic language model, 
each of which may be based on a source of text asso- 
ciated with a user. In one implementation, the topic lan- 
guage model may include a single language model as- 
sociated with a particular user that contains unigram, bi- 
gram, and trigram information. 
[0048] The various language models discussed 
above may be included in a single language model or 
may be divided into one or more associated language 
models. Each user of the speech recognition system 
may have one or more language models. 
[0049] Referring to Fig. 3, in another implementation, 
the speech recognition software 124 includes a recog- 
nizer 300 that, like the recognizer 215, receives and 
processes frames of an utterance to identify text corre- 
sponding to the utterance. The software 124. includes 
an automatic punctuation server ("AP Server") 302 that 
processes output from the recognizer 300. The recog- 
nizer 300 outputs result objects that describe the results 
of the recognition of the utterance or part of the utter- 
ance. 

[0050] A result object includes a set of information that 
the recognizer 300 acquires when a user speaks into 
the microphone 102. For example, if the user speaks 



"hello world," the result object contains a block of audio 
data that, when played back, would recite the user's 
speech "hello world," along with the times the speech 
started and ended. The result object also contains the 
5 word "hello," the time the word started and ended, and 
other information about the word "hello." Likewise, the 
result object contains the same information relating to 
the word "world." The result objects that are output by 
the recognizer 300 also include a list of recognition can- 
to didates, where each recognition candidate corresponds 
to a hypothesis and has an associated score. The result 
object also contains a list of alternative recognition can- 
didates, such as "fellow world" and "hello wood." with 
similar sets of information about each alternative. In 

19 summary, a result object contains the Information that 
the recognizer 300 knows or determines about what a 
user has just spoken. For one document, there are many 
result objects. The result objects may be stored within 
buffers (not shown) and may include acoustic data. The 

20 AP server 302 receives the acoustic data from the buff- 
ers associated with the recognizer 300. 

[0051] The AP server 302 interacts with a speech 
models database 303, which contains one or more 
acoustic and/or language models that may be associat- 
es ed with particular users. The AP server 302 may process 
requests from a particular user to access and load an 
acoustic and/or language model associated with that us- 
er from the speech models database 303. 
[0052] The AP server 302 and the recognizer 300 
30 communicate with a dictation module 304 through a 
communication module 306. The communication mod- 
ule 306 is an application program interface that enables 
other programs to call the recognizer 300. Thus, the 
communication module 306 includes software that de- 
35 fines a standard way for other software to call the rec- 
ognizer 300. Many of the calls through the communica- 
tion module 306 are directed to the AP server 302 or to 
the dictation module 304. The communication module 
306 may interact with a database 31 0 that includes nat- 
40 ural-language grammars, a database 31 1 that includes 
simple grammars, compatibility modules 312, and be 
used by external developers who need to access the 
communication module 306 through a software layer 
314. 

45 [0053] A voice command layer module 313 may re- 
ceive voice commands from a user and interact with the 
simple grammars database 311 to process those com- 
mands. The voice command layer module 31 3 also may 
interact with a custom command database 317 to ena- 

50 ble users to add custom commands to, and retrieve cus- 
tom commands from, the custom command database 
317. A natural language module 315 may receive more 
complex voice commands from a user and interact with 
the natural-language grammars database 31 0 to proc- 

55 ess those more complex commands. 

[0054] Tools module 319 includes additional compo- 
nents of the speech recognition software 124, such as, 
for example, executables and software to enable enroll- 
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merit, vocabulary enhancement, and testing. 
[0055] The dictation module 304 maintains a copy of 
the document that is being dictated, carries out request- 
ed actions like audio playback, capitalization, and cor- 
rection, and stores the correlation between speech 
sounds and written characters. Output from the dictation 
module 304 is sent to applications such as a text editing 
application 320 (e.g., Dragon Pad) that is able to be 
used with dictation; an application process map 322 that 
provides a continually-updated list of which other appli- 
cations are running; and a user interface 326 (e.g., 
Dragon Bar) that provides menu items for actions that 
can be performed by the system 1 00. Additionally, the 
AP server 302 interacts with the user interface engine 
328, which is the basic interface software for the recog- 
nizer 300. 

[0056] The AP server 302 uses a model that predicts, 
at every word gap, whether there is punctuation or no 
punctuation. In one implementation, the model includes 
a logistic regression model. If the AP server 302 deter- 
mines that the word gap includes a punctuation event, 
the AP server 302 determines the kind of punctuation 
mark to which the event corresponds. A word gap is the 
time between the end of one word and the beginning of 
the next consecutive word. The model uses words be- 
fore and after the gap (called text features) that may be 
accessed from the recognizer 300. The dictation module 
304 also maintains information about the order among 
the text blocks. The model also may use acoustic fea- 
tures such as the length of silence following a current 
gap, a function of pitch (e.g., the average pitch of the 
word two back from a current gap), and a ratio of the 
average pitches of words one forward and one back 
from the current gap. Acoustic features may be ac- 
cessed from the recognizer 300. 
[0057] Referring also to Fig. 4, the AP server 302 re- 
ceives the results objects 405 from the recognizer 300 
and forms wrapped results object 410 that includes 
choices 415. A wrapped results object 410 requires a 
calling program to use functions to access the data rath- 
er than letting the calling program access the data di- 
rectly. Thus, there is a function to access the audio data, 
a function to access the time data, and a function to ac- 
cess the Nth alternative in the list. 
[0058] After the AP server 302 completes processing ) 
Including autopunctuation, the output of the server is a 
sequence of recognized tokens that are passed to the 
dictation module 304 through the communication mod- 
ule 306. The dictation module 304 performs formatting 
functions on the outputted text by controlling or altering 
capitalization and spacing relative to inserted punctua- 
tion marks. Thus, if a period is inserted by the AP server 
302, whether from autopunctuation or from verbalized 
punctuation, the dictation module 304 inserts an extra 
space after the period and then capitalizes the next word 
following the period. The AP server 302 uses both lan- 
guage model and acoustic content to the left and to the 
right of the gap 420 or the potential insertion point. 



[0059] Referring also to Fig. 5, a block diagram illus- 
trates three utterances 505, 510, and 515. Each utter- 
ance includes a text block 520a, 520b, and 520c, a 
wrapped results object 410a, 410b, and 410c having 
s one or more choices 415a, 415b, and 415c, and recog- 
nizer result objects 405a, 405b, and 450c. When infor- 
mation for a particular insertion point is not all within a 
current utterance 505, the information from a previous 
utterance 5 1 0 and next utterance 5 1 5 also may be need- 
10 ed as defined by the dictation module 304 and then 
transmitted to the AP server 302. For example, the 
wrapped results object 410b may include a pointer 525 
to a previous wrapped results object 41 0a that is part of 
the previous utterance 510. Wrapped results object 
is 41 0b also may include a pointer 535 to the next wrapped 
results object 41 0c that Is part of the next utterance 515. 
In this manner, both language model and acoustic con- 
tent from utterances surrounding a current utterance 
505 may be used. 
20 [0060] Often, at the end of an utterance, there is con- 
siderable information in the words after a potential in- 
sertion point. In this case, the AP server 302 performs 
the modeling on the last word of the utterance once the 
next utterance is received and recognized by the recog- 
25 nizer 300. Thus, for example, if a user speaks a first ut- 
terance: 

"Here is some unpunctuated text" 

then pauses, and then speaks a second utterance: 

30 

"on another topic" 

the engine Ul 328 outputs: 

"Here is some unpunctuated text". 

35 

[0061] After the second utterance, punctuation is In- 
serted at the end of the first utterance (and capitalization 
is adjusted) and the engine Ul 328 now outputs: 

40 "Here is some unpunctuated text. On another topic" 

[0062] There are times during dictation when a user 
inserts text by typing. In this case, the text buffer for the 
dictation module 304 has no corresponding result ob- 
45 jects from the recognizer 300 and thus no acoustic data 
is available. 

[0063] A user may be able to turn off the automatic 
punctuation features of the AP server 302. 
[0064] If a user selects a single word within a middle 
50 text block, a correction dialogue window opens and the 
AP server 302 adds adjacent punctuation to the select- 
ed text if needed. If the user selects a word at the edges 
of a text block, then the AP server 302 may not insert 
punctuation to the selected text. Using the example 
55 above, if the AP server 302 did not perform the punctu- 
ation model, the output would be: 

"Here is some unpunctuated textl on another topic", 
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where I indicates an utterance boundary or a text 
block boundary, which acts as a potential insertion 
point. Once the AP server 302 runs the model tor 
punctuation, the output Is: 

5 

"Here is some unpunctuated text I. On another 
topic". 

[0065) Afterthe user selects the word "On," the output 
to the Ul is: 10 

"Here is some unpunctuated text I. On another top- 

ic", 

[0066] On the other hand, after selection of the word is 
"text," the output to the Ul Is: 

"Here is some unpunctuated text I. On another top- 
ic". 

20 

[0067] Note that the period is not selected if the word 
"text" is selected. 

[0068] Referring to Fig. 6, an exemplary process 600 
is illustrated to determine whether or not to insert non- 
verbalized punctuation, and if so, what type of non-ver- 25 
balized punctuation. Acoustic features 605 such as for 
example, word gap or average fundamental frequency, 
are extracted from the wrapped results object 410a. 
Similarly, language model features 615 are extracted 
from the language model 620 using the words extracted 30 
from the wrapped results object 410b. The language 
model features 61 5 may include the probability of a par- ' 
ticular trigram. Both the acoustic features 605 and the 
language model features 615 are provided as input to a 
punctuation/no punctuation model classifier 625 and a 35 
period/comma model classifier 630. 
[0069] Referring to Fig. 7, the punctuation/no punctu- 
ation model classifier 625 is used to estimate the prob- 
ability that there Is punctuation at a particular space be- 
tween words. The period/comma model classifier esti- *o 
mates the probability that, if there is punctuation, the 
punctuation is a period. One exemplary type of classifier 
includes a logistic regression. model, in which the input- 
ted features are combined in a linearly weighted model. 
The output is passed through a nonlinearity function to 
force the outcome to a probability (i.e., between 0 and 
1). One logistic function that may be used includes: 



log istfc{x) = so 

1+e 

[0070] For example, the punctuation/no punctuation 
model classifier 625 estimates the probability that there 
is non-verbalized punctuation at a particular space be- ss 
tween words (Pr(punctuation)). A weighted sum of fea- 
ture values Is fed through a nonlinear function to pro- 
duce a probability estimate, Pr(punctuation). If it is de- 



termined (705) that the Pr(punctuation) is less than a 
threshold level, T, then no punctuation is output. If It is 
determined (705) that the Pr(punctuatlon) is greater 
than the threshold level, T, then non-verbalized punctu- 
ation is inserted at the word gap based on the outcome 
of the period/comma model classifier 630. The threshold 
level, T, may be configurable. In one implementation, if 
Pr(punctuation)=T ; then non-verbalized punctuation is 
inserted. In another implementation, If Pr(punctuation) 
=T, then the non-verbalized punctuation is not inserted. 
[0071] The period/comma model classifier 630 pro- 
vides a probability that the non-verbalized punctuation 
is a period, Pr(period). In other implementations, other 
punctuation type classifiers similar to classifier 630 may 
be used. If it is determined (710) that the Pr(period) is 
greater than 0.5. then a non-verbalized period is output. 
If it is determined (710) that the Pr(period) is less than 
0.5, then a comma is output. In one implementation, if 
the Pr(period) = 0.5, then a period is output In another 
implementation, if the Pr(period) =0.5, then a comma is 
output. Probability thresholds other than 0.5 may be 
used. 

[0072] Referring to Fig. 8, a graphical user interface 
that includes a correction dialogue 800 may be present- 
ed to a user. The correction dialogue 800 may include 
a choice list 805, which includes the recognition alter- 
natives for the selection, augmented with punctuation 
choices. The correction, dialogue 800 includes a window 
to display a selected recognition result including the 
non-verbalized punctuation associated with the utter- 
ance. The choice list 805 includes a list of recognition 
alternatives with at least one of the recognition alterna- 
tives including a changed to the non-verbalized punctu- 
ation and associated adjustments in spacing and other 
punctuation. For example, as shown in Fig. 8, the cor- 
rection dialogue 800 includes an utterance 810 that 
states "This is a test. This is not a real test". A portion 
of the utterance 810 is highlighted to include the period 
and the word "This" following the period. The correction 
dialogue 800 includes a list of choices 805 to replace 
the highlighted text from the utterance 810. The choice 
list 805 includes at least one correction 815 that chang- 
es the non-verbalized punctuation to include a punctu- 
ation choice and the correct spacing. 
[0073] Similarly, Fig. 9 illustrates a graphical user in- 
terface that includes a correction dialogue 900 which 
presents the user with a choice list 905. The correction 
dialogue 900 includes a current utterance 910. In this 
instance, the current utterance 910 is highlighted and a 
list of recognition alternatives is displayed with at least 
one of the recognition alternatives including a change 
to the non-verbalized punctuation and associated ad- 
justments and spacing and other punctuation. For ex- 
ample, the first recognition alternative 915 in choice list 
905 is highlighted and offers the user a different recog- 
nition alternative that includes changing the period after 
the word "not" from a period to a comma to include 
. changing the spacing appropriately. 
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[0074] In one implementation, the rules for punctua- 
tion choices may include offering to change the inserted 
punctuation wherever autopunctuation was inserted, by 
changing a period to a comma, changing a comma to a 
period, deleting a period, or deleting a comma, and, at 
the start of an utterance, offering to insert a period, but 
not a comma, at the end of the previous utterance. 
[0075] Referring to Fig. 1 0, a process 1 000 illustrates 
that punctuation and spacing may be adjusted by the 
dictation module 304 for each proposed change in punc- 
tuation. If a comma is changed to a period, or if a period 
is inserted (step 1005), two spaces are inserted, and the 
next word is capitalized (step 1010). If a period is 
changed to a comma (step 1015), then the system de- 
termines whether the tokenizer indicates that the next 
word is not a word In its uncapitalized form, or much less 
likely than the capitalized form (step 1020). If it Is appro- 
priate to decapitalize the next word, then the system de- 
capitalizes the next word and removes a space (step 
1 025). If it is not appropriate to decapitalize the next 
word (step 1 020), then a space is removed (step 1 030). 
For example, if a user removes a period before the name 
"England," the module 304 does not decapitalize, be- 
cause the dictionary does not include a word "england," 
If a user removes a period before the word "I," or before 
the name "John" the module 304 does not decapitalize, 
even though Y and "john" are words in the dictionary, 
because they are much less frequent than their capital- 
ized forms. 

[0076] If the user selects some text and retypes, no 
autopunctuation is done on the retyped text. 
[0077] If the user corrects some text directly in the text 
buffer or by respeaking the text, without bringing up the 
correction dialog, the AP server 302, in certain circum- 
stances, adjusts text in the corresponding text buffer. 
[0078] The application editor, through control of the 
AP server 302 and the dictation module 304, is able to 
adjust spacing and capitalization upon insertion or de- 
letion of punctuation and/or text through key strokes; up- 
on deletion by a natural language grammar command 
(for example, "delete next word" or "delete next line"); 
upon deletion using a dictation command (for example, 
"delete that"); and upon insertion by speaking "period." 
[0079] Thus, a user is able to correct text and format- 
ting with the use of a single action. For example, if the 
user removes a period (using a key stroke or using a 
dictation command), then the AP server 302 and dicta- 
tion module 304 additionally remove a space and re- 
move capitalization as needed. The actions of removing 
the space and removing the capitalization occur when 
the user performs only the single action of deleting a 
period. When the user acts to edit punctuation, the AP 
server 302 invokes the dictation module 304 to re-format 
the area surrounding the edited punctuation. 
[0080] Referring to Fig. 1 1 , a process 1 1 00 for recog- 
nizing punctuation in computer-implemented speech 
recognition includes performing speech recognition on 
an utterance to produce a speech recognition result for 



the utterance (step 11 1 0). A non-verbalized punctuation 
mark is identified in a recognition result (step 1120) and 
the recognition result is formatted based on the identifi- 
cation (step 1130). 
5 [0081] Referring to Fig. 12, a process 1200 for cor- 
recting incorrect text associated with recognition errors 
in computer-implemented speech recognition includes 
performing speech recognition on an utterance to pro- 
duce a speech recognition result for the utterance (step 
10 1210). A portion of the recognition result that includes 
the non-verbalized punctuation may be selected for cor- 
rection (step 1220). The portion of the recognition result 
that includes the non-verbalized punctuation may be 
corrected with one of a number of correction choices 
is (step 1230). 

[0082] The described systems, methods, and tech- 
niques maybe implemented in digital electronic circuitry 
and/or analog circuitry, computer hardware, firmware, 
software, or in combinations of these elements. Appa- 
ratus embodying these techniques may include appro- 
priate input and output devices, a computer processor 
and a computer program product tangibly embodied in 
a machine-readable storage device for execution by a 
programmable processor. A process embodying these 
techniques may be performed by a programmable proc- 
essor executing a program of instructions to perform de- 
sired functions by operating on input data and generat- 
ing appropriate output. The techniques may be imple- 
mented in one or more computer programs that are ex- 
ecutable on a programmable system including at least 
one programmable processor coupled to receive data 
and instructions from, and to transmit data and instruc- 
tions to, a data storage system, at least one input device, 
and at least one output device. Each computer program 
may be implemented in a high-level procedural or ob- 
ject-oriented programming language, or in assembly or 
machine language if desired; and in any case, the lan- 
guage may be a compiled or interpreted language. Suit- 
able processors include, by way of example, both gen- 
eral and special purpose microprocessors. Generally, a 
processor will receive instructions and data from a read- 
only memory and/or a random access memory. Storage 
devices suitable for tangibly embodying computer pro- 
gram instructions and data include all forms of non-vol- 
atile memory, including by way of example semiconduc- 
tor memory devices, such as Erasable Programmable 
Read-Only Memory (EPROM), Electrically Erasable 
Programmable Read-Only Memory (EEPROM), and 
flash memory devices; magnetic disks such as internal 
hard disks and removable disks; magneto-optical disks; 
Digital Video Disc Read-Only Memory (DVD-ROM); and 
Compact Disc Read-Only Memory (CD-ROM). Any of 
the foregoing may be supplemented by, or incorporated 
in, specially-designed ASICs (application-specific inte- 
grated circuits). 

[0083] It will be understood that various modifications 
may be made without departing from the spirit and 
scope of the claims. For example, advantageous results 
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still could be achieved if steps of the disclosed tech- 
niques were performed in a different.order and/or if com- 
ponents in the disclosed systems were combined in a 
different manner and/or replaced or supplemented by 
other components. Accordingly, other implementations 
are within the scope of the following claims. For exam- 
ple, other types non-verbalized punctuation may be au- 
tomatically inserted and formatted in speech recognition 
including, but not limited to, questions marks, exclama- 
tions points, quotations, apostrophes, colons, semico- 
lons, and hyphens. Also, the different types of non-ver- 
balized punctuation may be automatically inserted and 
formatted according to the punctuation and grammar 
rules of languages other than English. 



Claims 

1. A method of recognizing punctuation in computer- 
implemented speech recognition, the method com- 
prising: 

performing speech recognition on an utterance 
to produce a recognition result for the utter- 
ance; 

identifying a non-verbalized punctuation mark 

in a recognition result; and 

formatting the recognition result based on the 

identification. 

2. The method according to claim 1 , wherein identify- 
ing the n on -verbalized punctuation mark includes 
predicting the non-verbalized punctuation mark us- 
ing at least one text feature and at least one acous- 
tic feature related to the utterance. 

3. The method according to claim 2, wherein the 
acoustic feature includes a period of silence, a func- 
tion of pitch of words near the period of silence, an 
average pitch of words near the period of silence, 
and/or a ratio of an average pitch of words near the 
period of silence. 

4. The method according to any one of the preceding 
claims, wherein formatting the recognition result in- 
cludes controlling or altering spacing relative to the 
non-verbalized punctuation mark, and/or control- 
ling or altering capitalization of words relative to the 
non-verbalized punctuation mark. 

5. The method according to any one of the preceding 
claims, wherein: 

the non-verbalized punctuation mark includes 
a period, and 

formatting the recognition result includes in- 
serting an extra space after the period and cap- 
italizing a next word following the period. 



6. The method according to any one of the preceding 
claims, further comprising: 

selecting a portion of the recognition result to 
5 be corrected that includes the n on -verbalized 

punctuation mark; and 

correcting the portion of the recognition result 
that includes the non-verbalized punctuation 
mark with one of a number of correction choic- 
10 es. 

7. The method according to claim 6, wherein at least 
one of the correction choices includes a change to 
the non-verbalized punctuation mark. 

15 

8. The method according to claim 6, wherein at least 
one of the correction choices does not Include the 
non-verbalized punctuation mark. 

20 9. An apparatus comprising a computer-readable me- 
dium having instructions stored thereon that when 
executed by a machine result in at least the follow- 
ing: 

25 performing speech recognition on an utterance 

to produce a recognition result for the utter- 
ance; 

identifying a non-verbalized punctuation mark 
in a recognition result; and 
30 formatting the recognition result based on the 

identification. 

10. A method of correcting incorrect text associated 
with recognition errors in computer-implemented 

35 speech recognition, comprising: 

performing speech recognition on an utterance 
to produce a recognition result for the utter- 
ance, wherein the recognition result includes 
40 non-verbalized punctuation; 

selecting a portion of the recognition result to 
be corrected that includes the non-verbalized 
punctuation; and 

correcting the portion of the recognition result 
45 that includes the non-verbalized punctuation 

with one of a number of correction choices. 

11. The method according to claim 10, wherein at least 
one of the correction choices includes a change to 

so the non-verbalized punctuation. 

12. The method according to claim 10, wherein at least 
one of the correction choices does not include the 
non-verbalized punctuation. 

55 

13. The method according to any one of claims 10 to 
12, where the non-verbalized punctuation includes 
a non-verbalized punctuation mark. 
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14. The method according to any one of claims 10 to 
1 3, wherein correcting the portion of the recognition 
result includes changing the no n -verbalized punc- 
tuation and reformatting text surrounding the non- 
verbalized punctuation to be grammatically consist- 
ent with the changed non-verbalized punctuation. 

15. The method according to claim 14, wherein chang- 
ing the non-verbalized punctuation and reformat- 
ting text surrounding the non-verbalized punctua- 
tion is in response to a single user action. 

16. An apparatus comprising a computer-readable me- 
dium having instructions stored thereon that when 
executed by a machine result in at least the follow- 
ing: 

performing speech recognition on an utterance 
to produce a recognition result for the utter- 
ance, wherein the recognition result includes 
non-verbalized punctuation; 
selecting a portion of the recognition result to 
be corrected that includes the non-verbalized 
punctuation; and 

correcting the portion of the recognition result 
that includes the non-verbalized punctuation 
with one of a number of correction choices. 

17. A method of recognizing punctuation In computer- 
implemented speech recognition dictation, the 
method comprising: 

performing speech recognition on an utterance 
to produce a recognition result for the utter- 
ance; 

identifying a non-verbalized punctuation mark 
in a recognition result; and 
determining where to insert the non-verbalized 
punctuation mark within the recognition result 
based on the identification using at least one 
text feature and at least one acoustic feature 
related to the utterance to predict where to in- 
sert the non-verbalized punctuation mark. 

18. The method according to claim 17, wherein the 
acoustic feature includes a period of silence, a func- 
tion of pitch of words near the period of silence, an 
average pitch of words near the period of silence, 
and/or a ratio of an average pitch of words near the 
period of silence. 

19. An apparatus comprising a computer-readable me- 
dium having instructions stored thereon that when 
executed by a machine result in at least the follow- 
ing: 

performing speech recognition on an utterance 
to produce a recognition result for the utter- 



ance; 

identifying a non-verbalized punctuation mark 
in a recognition result; and 
determining where to insert the non-verbalized 
5 punctuation mark within the recognition result 

based on the Identification using at least one 
text feature and at least one acoustic feature 
related to the utterance to predict where to in- 
sert the non-verbalized punctuation mark. 

w 

20. A graphical user interface for correcting incorrect 
text associated with recognition errors in computer- 
implemented speech recognition, comprising: 

is a window to display a selected recognition re- 

sult including non-verbalized punctuation asso- 
ciated with an utterance; and 
a list of recognition alternatives with at least one 
of the recognition alternatives including a 

20 change to the non-verbalized punctuation and 

associated adjustments in spacing and other 
punctuation. 

21 . The graphical user interface of claim 20, wherein 
25 the non-verbalized punctuation includes a period, 

or a comma. 

22. The graphical user interface of claim 20 or claim 21 , 
wherein: 

30 

the change to the non-verbalized punctuation 
includes a change from a period to a comma, 
and 

the associated adjustments in spacing and oth- 
35 er punctuation includes removing a space after 

the comma and uncapitalizing a word following 
the comma. 

23. The graphical user interface of claim 20 or claim 21 , 
40 wherein: 

the change to the non-verbalized punctuation 
includes a change from a comma to a period, 
and 

45 the associated adjustments in spacing and oth- 

er punctuation includes adding a space after 
the period and capitalizing a word following the 
period. 

50 
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